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Abstract 

In  this  paper  we  describe  a  new  algorithm  for  maintaining  a  balanced  search  tree  on  a  message-passing 
MIMD  architecture;  the  algorithm  is  particularly  well  suited  for  implementation  on  a  small  number  of 
processors.  We  introduce  a  (2B~2,2B)  search  tree  that  uses  a  linear  array  of  0(log  n)  processors  to 
store  n  entries.  Update  operations  use  a  bottom-up  node-splitting  scheme,  which  performs  better  than 
top-down  search  tree  algorithms.  Additionally,  for  a  given  cost  ratio  of  computation  to  communication 
the  value  of  B  may  be  varied  to  maximize  performance.  Implementations  on  a  parallel-architecture 
simulator  are  described. 
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1  INTRODUCTION 


1  Introduction 

We  introduce  a  new  balanced  search  tree  algorithm  for  message-passing  architectures.  The  algorithm 
assumes  a  linear  array  of  processors  each  with  a  large  local  memory;  such  arrays  are  easily  emulated  on 
most  message-passing  MIMD  architectures.  The  algorithm  performs  updates  using  a  bottom-up  node¬ 
splitting  scheme  and  shows  improvements  in  throughput  and  response  time  when  compared  to  similar 
top-down  algorithms  [GS78,  CT84]. 

Search  trees  are  widely  used  for  fast  implementations  of  dictionary  abstract  data  types.  A  dictionary 
is  a  partial  mapping  from  keys  to  data  that  supports  three  operations:  insert,  delete  and  search.  For 
simplicity  we  will  assume  that  the  dictionary  stores  no  data  with  the  keys  and  so  may  be  viewed  as  a 
set  of  keys.  A  number  of  useful  computations  can  be  implemented  in  terms  of  dictionary  abstract  data 
types,  including  symbol  tables,  priority  queues  and  pattern-matching  systems. 

The  B-tree  was  originally  introduced  by  Bayer  [Bay72].  The  B-tree  algorithms  for  sequential  ap¬ 
plications  were  designed  to  minimize  the  response  time  for  a  single  query  and  the  sequential  algorithm 
for  a  single  search  operation  on  a  balanced  B-tree  has  logarithmic  complexity.  The  improvement  in  the 
response  time  that  can  be  achieved  by  a  parallel  algorithm  for  a  single  search  can  at  best  be  logarithmic 
in  the  number  of  processors  [Qui87].  Thus,  for  parallel  systems  a  more  important  concern  is  the  system 
throughput  for  a  series  of  search,  insertion  and  deletion  operations  executing  in  parallel. 

We  introduce  the  (2B~2 ,2B)1  search  tree,  a  variation  of  the  B-tree.  A  (2B_2,2B)  search  tree  (for 
B  >  3)  is  a  tree  in  which  every  branch  node  (except  the  root)  has  between  2s-2  and  2s  children,  and 
every  path  from  the  root  to  a  leaf  has  the  same  length.  For  example,  B=8  gives  a  (2,8)  tree,  B-J,  gives 
a  (4, 16)  tree  and  so  on.  The  root  has  between  2  and  2B  children.  Only  leaf  nodes  store  key  values; 
branch  nodes  store  index  information  used  to  find  the  appropriate  leaf  node.  A  leaf  node  stores  between 
2b~2  and  2B  keys.  Each  leaf  node  stores  key  values  within  a  contiguous  range;  the  ranges  of  all  leaf 
nodes  partition  the  set  of  possible  key  values  and  are  in  ascending  order  from  the  leftmost  node  to  the 
rightmost.  The  index  information  stored  at  a  non-leaf  node  is  simply  the  lowest  key  value  associated 
with  each  of  its  children.  Thus  the  set  of  key  values  is  partitioned  at  every  level  in  the  tree. 

A  search  tree  of  n  entries  is  implemented  on  an  array  of  up  to  log2n/(B  -  2)  +  1  processors.  Each 
processor  holds  a  level  of  the  tree  structure  in  local  memory  and  the  last  processor  stores  the  actual  keys. 
Therefore,  the  memory  required  to  store  the  search  tree  increases  by  a  factor  of  2°  between  adjacent 
processors  down  the  linear  array.  Figure  1  shows  this  configuration  for  a  (2,8)  tree.  Processor  Pi  stores 
the  leaf  nodes  of  the  tree.  Operations  are  invoked  by  requests  entering  the  array  at  the  top;  the  replies 
these  gen  "  '*  leave  from  the  bottom.  Each  processor  communicates  only  with  its  immediate  neighbors. 


'This  is  pronounced,  “two  to  the  B  minus  two,  two  to  the  B”;  or  simply  “two  B" . 
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Requests 


Replies 


leaf  nodes 


Figure  1:  The  processor  configuration  for  a  (2,8)  tree  of  height  H. 


In  our  new  algorithm  a  dictionary  is  represented  by  a  modified  {2B~2 ,2B)  search  tree  in  which  links 
are  added  from  each  node  to  its  left  and  right  neighbors  in  the  same  level.  This  is  similar  to  the  algorithm 
developed  by  Weihl  and  Wang  [WW90,  Wan91]  for  B-trees,  which  in  turn  was  based  on  an  algorithm 
developed  by  Lehman  and  Yao  [LY81]  and  modified  by  Sagiv  [Sag86].  In  these  link  method  algorithms 
[SG88]  right  links  are  added  to  adjacent  nodes  in  a  B-tree.  In  the  new  algorithm  both  left  and  right 
links  are  added  between  adjacent  siblings. 

Carey  and  Thompson  [CT84]  implemented  a  2-3-4  search  tree  using  a  linear  array  of  O(logn)  pro¬ 
cessors.  A  2-3-4  free  is  a  tree  in  which  each  node  that  is  not  a  leaf  has  two,  three  or  four  children, 
and  every  path  from  the  root  to  a  leaf  is  the  same  length.  The  scheme  allows  update  operations  to  be 
performed  using  the  top-down  node-splitting  scheme  presented  by  Guibas  and  Sedgewick  [GS78].  Mond 
and  Raz  [MR85]  als^  proposed  a  top-down  strategy  for  B-trees.  A  linear  array  of  processors  was  used 
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2  CONCURRENT  SEARCH  TREE  ALGORITHMS 


by  Tanaka,  Nozakaand  Masuyama  [TNM80]  in  their  pipelined  binary-tree  algorithm,  and  Fisher  [Fis84] 
proposed  a  pipeline  system  that  used  a  pipeline  length  proportional  to  the  length  of  the  key. 

We  first  implemented  a  (2S-2, 2s)  search  tree  using  a  top-down  node-splitting  scheme.  We  then 
implemented  the  search  tree  using  the  bottom-up  algorithm.  Both  of  these  algorithms  have  been  im¬ 
plemented  using  Proteus  [Del91,  Bre91,  BDCW91],  a  multiprocessor  simulator  developed  at  MIT.  We 
measured  the  throughput  and  response  times  for  various  processor-array  lengths,  tree-branching  factors, 
query  mixes,  message-p?.?«ing  paradigms  and  one-way  message  delays.  From  these  measurements  we 
compared  the  performance  of  the  algorithms  and  determined  the  optimal  tree  structures.  The  bottom- 
up  algorithm  has  better  throughput  and  response  time  than  the  top-down  algorithm  in  every  case. 

In  Section  2  we  present  the  issues  that  must  be  addressed  by  concurrent  search  tree  algorithms  and 
motivate  the  development  of  the  new  algorithm.  Sections  3  and  4  describe  the  top-down  and  bottom-up 
algorithms.  Section  5  describes  the  Proteus  simulator  and  Section  6  outlines  the  implementation  of  the 
algorithms  using  Proteus.  Section  7  presents  the  performance  evaluation  of  both  algorithms;  Section  8 
describes  design  alternatives  for  the  bottom-up  algorithm.  Finally,  we  present  our  conclusions  in  Section 
9. 

2  Concurrent  Search  Tree  Algorithms 

The  large  number  of  concurrent  search  tree  algorithms  presented  in  the  literature  prevents  a  complete 
description  of  each  in  this  paper.  Instead,  we  discuss  the  common  issues  that  are  addressed  by  all  of  the 
algorithms.  Much  of  this  discussion  is  based  upon  Wang’s  analysis  of  concurrent  search  tree  algorithms 
[Wan91], 

All  concurrent  search  tree  algorithms  share  the  problem  of  managing  contention.  Concurrency  control 
is  required  to  ensure  that  two  or  more  independent  processes  accessing  a  B-tree  do  not  interfere.  A 
common  approach  is  to  associate  a  read/ write  lock  with  every  node  in  the  search  tree  [LSS87].  This 
causes  data  contention  as  writers  block  incoming  writers  and  readers,  and  readers  block  incoming  writers. 
The  contention  is  severe  when  it  occurs  at  higher  levels  in  the  search  tree,  particularly  at  the  root,  which 
is  often  termed  a  root  bottleneck. 

Similar  problems  are  caused  by  resource  contention.  In  a  shared-memory  architecture  all  of  the 
processes  trying  to  access  the  same  tree  node  will  access  the  same  memory  module  on  the  machine. 
Similarly,  for  message-passing  architectures,  the  processor  on  which  a  node  resides  will  receive  messages 
from  every  processor  trying  to  access  the  node.  Resource  contention  is  again  most  serious  for  the  higher 
levels  in  the  search  tree.  Node  replication  [Wan91]  reduces  contention  but  requires  a  coherence  protocol 
to  maintain  consistency. 
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Associated  with  the  contention  issue  is  the  problem  of  process  overtaking.  This  may  occur  when  a 
process  that  holds  a  lock  selects  the  next  node  it  wishes  to  access,  releases  its  lock  and  attempts  to  acquire 
a  lock  for  the  next  node.  A  second  process  may  acquire  a  lock  on  the  next  node  between  the  original 
process  releasing  the  old  lock  and  acquiring  the  lock  for  the  next  node.  The  second  process  can  then 
update  the  node  in  such  a  fashion  as  to  cause  the  first  process  to  lock  the  wrong  node  when  it  eventually 
acquires  the  lock.  To  prevent  this  kind  of  process  overtaking  many  algorithms  have  their  operations 
use  lock  coupling  to  block  independent  operations.  An  operation  traverses  the  tree  by  obtaining  the 
appropriate  lock  on  the  child  before  releasing  the  lock  it  holds  on  the  parent.  This  technique  is  used 
in  the  top-down  algorithms  [GS78,  CT84,  MR85,  CS90].  The  link  method  algorithms  eliminate  the 
need  for  lock  coupling.  If  the  wrong  node  is  reached  at  any  stage  the  side  links  are  traversed  until  the 
correct  node  is  found.  This  reduces  the  number  of  locks  that  must  be  held  concurrently  and  increases 
throughput.  However,  traversing  the  links  could  in  theory  lead  to  an  increase  in  the  response  time. 

Linear  arrays  of  processors  provide  a  processor-efficient  means  of  implementing  search  tree  algorithms. 
Since  each  level  of  the  search  tree  is  stored  in  the  local  memory  of  a  single  processor,  the  contention 
for  resources  is  approximately  uniform  throughout  the  structure.  Our  algorithm  uses  this  structure  and 
is  very  simple.  The  code  that  implements  a  level  of  the  structure  is  replicated  on  all  the  processors 
in  the  array,  with  some  minor  modifications  for  the  processor  storing  the  leaf  nodes.  Side  links  avoid 
the  need  for  lock  coupling,  which  results  in  higher  concurrency.  The  bottom-up  algorithm  presented  in 
this  paper  leads  to  significant  improvements  in  throughput  over  the  top-down  algorithms.  In  addition, 
we  show  that  the  response  time  for  the  new  algorithm  is  also  less  than  that  for  top-down  algorithms. 
Our  results  show  that  the  bottom-up  algorithm  described  here  has  the  best  performance  of  any  of  the 
implementat''  ns  of  search  trees  for  linear  arrays  described  to  date. 


3  The  Top-Down  Algorithm 

The  top-down  algorithm  allows  insertions,  deletions  and  exact-match  searches  on  a  ( 2B~2,2B )  search 
tree.  The  search  operation  is  a  simple  version  of  the  normal  B-tree  search  operation  [Corr.79].  The  insert 
and  delete  operations  are  based  upon  the  top-down  node-splitting  scheme  introduced  by  Guibas  and 
Sedgewick  [GS78],  in  which  transformations  are  applied  during  a  single  traversal  of  the  tree.  The  tree  is 
traversed  from  the  root  downward  and  transformations  are  applied  between  adjacent  levels  at  the  same 
time. 

The  insert  operation  performs  node  splitting  upon  encountering  a  2B-branch  node,  other  than  the 
root.  A  new  right  brother  of  the  node  is  created  and  the  branches  of  the  original  node  are  divided, 
forming  two  2B~ ‘-branch  nodes.  Figure  2  shows  an  example  of  this  transformation  applied  to  a  (2,8) 
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tree.  This  transformation  ensures  that  any  future  node  splitting  does  not  cause  upward  propagation  in 
the  tree  structure;  thi3  allows  the  transformation  to  be  applied  in  a  top-down  fashion.  This  follows  by 
induction  on  the  depth  of  the  tree,  since  any  transformation  •«  applied  to  the  parent  of  a  node  before 
being  applied  to  the  node  itself.  Therefore,  the  parent  must  have  a  degree  of  at  most  2s  —  1  when  a 
transformation  is  applied  to  the  node. 

When  a  delete  operation  encounters  a  2B-2-branch  node,  other  than  the  root,  one  of  two  deletion 
transformations  is  applied.  If  a  neighboring  node  has  less  than  or  equal  to  2B_1  branches,  the  node  and 
its  neighbo  •  are  merged  to  form  a  single  node.  Otherwise  the  branches  of  the  node  and  its  neighbor  are 
redistributed  evenly  between  the  two  nodes.  Figures  3  and  4  show  examples  of  these  transformations 
applied  to  a  (2, 8)  tree.  The  neighbor  relationship  used  in  the  deletion  algorithm  relates  a  node  to  its  right 
brother  in  the  subtree,  or  in  the  case  of  the  rightmost  node,  to  its  left  brother.  These  transformations 
ensure  that  any  future  merging  of  nodes  will  not  cause  upward  propagation  of  transformations. 

Several  B-tree  algorithms  perform  delete  restructuring  only  when  a  node  is  empty  [Wan91],  These 
strategies  reduce  the  probability  that  the  nodes  need  to  be  merged,  thus  reducing  the  amount  of  work 
required.  However,  merging  nodes  when  they  are  one  quarter  full  preserves  efficient  space  utilization, 
and  bounds  the  height  of  the  tree  and  therefore  the  number  of  processors  required.  This  is  particularly 
important  for  linear  array  implementations,  where  efficient  space  utilization  is  required  and  the  number 
of  available  processors  may  be  limited. 

When  the  transformations  are  applied  to  a  root  node,  the  insertion  transformation  converts  a  root 
node  with  2B  branches  into  a  double  2B_1-branch  node  configuration  and  a  new  root  node,  increasing 
the  height  of  the  tree.  The  merging  deletion  transformation  converts  a  root  node  with  2  branches  into 
a  new  root  node  formed  by  the  merging  of  the  root’s  children  thus  reducing  the  height  of  the  tree.  The 
redistribution  deletion  transformation  does  not  lead  to  a  reduction  in  the  height  of  the  tree. 

A  further  property  of  the  (2B-2,2B)  tree  is  worth  noting  at  this  point.  The  insertion  and  deletion 
transformations  of  some  B-tree  algorithms  [CT84,  Com79j  are  direct  inverses.  In  these  algorithms  an 
application  of  a  transformation  immediately  followed  by  its  inverse  can  occur  and  this  can  result  in 
oscillation.  For  example,  in  the  2-3-4  tree  algorithm  described  by  Guibas  and  Sedgewick  [GS78],  an 
insertion  transformation  splits  a  4-branch  node  into  two  2-branch  nodes.  If  a  deletion  operation  is 
then  applied  to  the  original  node,  the  two  2-branch  nodes  are  merged  reforming  the  original  4-branch 
node.  This  allows  one  transformation  per  operation.  This  problem  is  most  severe  in  higher  levels  of 
the  search  tree  where  updates  are  less  frequent.  Performing  these  transformations  increases  the  average 
response  time  and  reduces  the  throughput  of  the  system.  A  stabilizing  (hysteresis)  effect  occurs  in  the 
(2b~2,2b)  tree  since  the  insertion  transformation  leaves  each  node  (except  the  root)  with  2B_1  children. 
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Before  a  deletion  transformation  may  be  applied  to  the  node,  2B~ 2  children  must  be  removed  from  it.  2 
Therefore,  the  oscillations  encountered  in  other  algorithms  cannot  occur.  The  root,  node  exhibits  similar 
hysteresis  behavior.  When  the  tree  grows  a  new  root  node  is  created  with  two  children.  Each  of  these 
nodes  has  2B~l  branches.  Before  the  tree  can  shrink,  2s-2  children  must  be  removed  from  one  of  these 
nodes. 

The  implementation  of  the  top-down  algorithm  on  a  reconfigurable  system  of  transputers  [Inm86]  is 
described  in  Colbrook  and  Smythe  [CS90].  It  is  worth  noting  that  the  Proteus  simulator  gave  comparable 
results  to  those  reported  by  Colbrook  and  Smythe. 

4  The  Bottom-Up  Algorithm 

In  the  bottom-up  algorithm  both  left  and  right  links  are  added  between  adjacent  siblings  in  a  (2B~2, 2B) 
search  tree.  These  links  provide  an  additional  method  of  reaching  a  node.  The  intent  of  this  scheme 
is  to  make  all  nodes  in  a  single  level  reachable  from  any  other  node  at  that  level.  If  the  wrong  node 
is  selected  at  the  level  above  then  the  correct  node  can  be  found  by  using  the  links.  This  allows  for 
an  efficient  solution  to  the  process  overtaking  problem  and  permits  changes  to  the  tree  structure  to  be 
made  by  a  background  task. 

The  algorithm  again  allows  insertions,  deletions  and  exact-match  searches  i  n  a  (2B~2,2B)  search 
tree.  Each  of  these  operations  begins  by  calling  a  find  operation,  which  traverses  the  tree  from  the  root 
until  it  reaches  the  leaf  node  that  may  store  the  specified  key.  For  a  search  operation  tne  keys  stored  at 
the  leaf  node  are  then  searched  for  the  specified  key  and  the  result  Ls  sent  to  the  inquiring  process. 

An  insert  operation  attempts  to  add  the  specified  key  to  the  keys  stored  at  the  leaf.  If  the  leaf 
already  stores  2B  keys  and  the  specified  key  is  not  one  of  these,  then  an  insertion  transformation  is 
applied  causing  the  leaf  node  to  be  split  into  two  2B_1-key  nodes.  The  new  node  becomes  the  right 
brother  of  the  original  leaf  node.  The  specified  key  is  the  added  to  the  keys  stored  at  the  appropriate 
node  and  the  insert  operation  returns.  The  work  required  to  propagate  the  split  to  higher  levels  is 
carried  out  as  a  background  task.  This  task  adds  a  pointer  to  the  newly  created  node  at  the  appropriate 
node  in  the  next  level.  This  in  turn  may  cause  an  insertion  transformation:  the  split  is  propagated  until 
a  level  is  reached  where  no  split  occurs.  Should  the  root  of  the  tree  be  split  then  a  new  root  is  created 
and  the  height  of  the  tree  increases  by  one.  Figure  5  shows  an  example  of  the  insertion  transformation 
applied  to  a  (2,8)  tree. 

A  delete  operation  attempts  to  remove  the  specified  key  from  the  keys  stored  at  the  leaf.  If  the  leaf 
stores  2l,~~  keys,  one  of  which  is  the  specified  key,  then  one  of  two  transformations  is  applied  to  the  node 


2 Since  B  >  3  =>  2D“J  >  2. 
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and  its  right  neighbor  in  the  tree  (or  the  left  neighbor  for  the  rightmost  leaf).  These  transformations 
are  identical  to  those  used  for  the  top-down  algorithm.  The  delete  operation  then  returns  am  *he 
propagation  of  the  transformation  to  higher  levels  is  again  carried  out  as  a  background  task.  In  this 
case  there  is  no  guarantee  that  the  node  and  its  neighbor  share  the  same  parent  at  the  higher  level,  so 
the  transformation  may  cause  the  index  values  associated  with  the  nodes  at  a  higher  level  to  change. 
Transformations  and  changes  to  the  index  values  are  propagated  until  no  further  changes  are  required 
at  a  higher  level.  Should  the  nodes  at  the  level  below  the  root  be  merged  to  form  a  single  node  then 
this  node  becomes  the  root  and  the  height  of  the  tree  decreases.  Figures  6  and  7  show  examples  of  these 
transformations  applied  to  a  (2, 8)  tree. 

Since  the  changes  to  non-leaf  levels  caused  by  a  transformation  are  carried  out  as  a  background 
task,  the  find  operation  must  guarantee  that  t'  :  correct  leaf  node  is  reached  even  if  the  tree  traversed  is 
made  between  a  transformation  at  a  leaf  node  and  the  completion  of  the  subsequent  transformations  and 
changes  at  higher  levels.  To  achieve  this  we  introduce  a  notion  of  covers  for  each  node.  An  index  node 
has  associated  with  it  a  value  i,  the  minimum  key  value  that  may  be  stored  in  the  subtree  rooted  at  the 
node.  A  key  k  is  covered  by  a  node  xif  and  only  if  x’s  index  label  and  the  index  label  of  x’s  right  neighbor, 
y,  indicate  that  the  leaf  that  may  store  k  is  a  descendant  of  x.  That  is,  covers(k,x)  <=>  x.i  <  k  <  y.i. 
When  x  is  the  rightmost  node  at  a  given  level,  covers(k,  x )  x.i  <  k.  When  a  find  operation  encounters 
a  non-leaf  node  that  does  not  cover  the  specified  key,  then  the  level  is  traversed  from  the  node  to  either 
the  left  (if  x.i  >  k)  or  to  the  right  (if  y.i  <  until  the  node  that  covers  k  is  found. 

In  the  case  of  a  deletion  transformation  that  causes  two  nodes  to  be  merged,  the  neighbor  is  not 
removed  immediately  but  is  flagged  as  deleted  and  the  links  pointing  to  it  from  other  nodes  at  the  same 
level  are  updated.  This  allows  references  to  a  deleted  node  to  be  made  by  find  until  the  result  of  the 
transformation  has  propagated  to  the  higher  level.  When  the  find  operation  encounters  a  deleted  node 
the  traversal  immediately  begins  at  the  left  brother  of  the  deleted  node. 


5  The  Simulator 

The  Proteus  system  simulates  the  events  that  take  place  in  a  parallel  machine  at  the  level  of  individual 
machine  instructions.  The  user  writes  a  parallel  program  using  a  simple  superset  of  the  C  programming 
language  and  a  set  of  supported  simulator  calls.  The  parts  ol  the  user  program  executed  locally  on 
each  processor  are  written  in  standard  C  and  translated  by  the  C  compiler  into  machine  code  for  the 
computer  running  the  simulation.  All  nonlocal  interactions,  such  as  message  passing,  are  performed  by 
the  supported  simulat  <r  calls,  which  correspond  to  the  machine  code  instructions  that  perform  nonlocal 
interactions  in  real  parallel  machines. 
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We  simulate  a  multiprocessor  configured  as  a  bidirectional  ring.  Each  node  consists  of  a  processor, 
local  memory,  and  a  network  chip  for  routing  messagi  s  without  using  processor  cycles.  The  network  is 
packet  switched  and  uses  wormhole  routing.  We  assume  that  a  message  fus  in  one  packet.  Wormhole 
routing  (as  opposed  to  store-and-forward)  is  relevant  only  for  messages  that  travel  more  than  one  hop. 
We  would  expect  a  store-and-forward  network  to  produce  similar  throughput  results,  but  have  worse 
response  times.  The  latter  results  from  the  intermediate-node  delays  on  messages  to  and  from  the 
processor  used  to  send  queries  to  the  root  node  and  receive  replies  from  leaf  and  root  nodes. 

The  one-way  delay  for  a  message  is  the  sum  of  several  values:  the  tin  i  to  put  a  message  on  the 
network,  the  delay  across  the  network,  and  the  delay  at  the  target.  We  assume  a  minimum  wire  delay 
of  one  cycle  per  (4-byte)  word  and  a  minimum  switch  delay  of  one  cycle  per  word.  Without,  contention 
a  five-word  message  (the  typical  message  size  for  both  algorithms)  requires  seven  cycles  to  go  one  hop: 
one  cycle  each  for  the  source  processor,  the  wire,  and  the  target  processor  for  the  first  word  and  an 
additional  cycle  for  each  of  the  subsequent  words.  We  discuss  the  effect  of  longer  message  delays  in 
Section  7. 

6  Implementation 

The  top-down  and  bot*om-up  algorithms  (tave  been  implemented  on  Proteus.  A  designated  processor, 
termed  the  Server,  is  used  to  send  queries  to  and  receive  replies  from  the  processor  storing  the  root  node. 
The  Server  also  receives  the  replies  generated  by  the  queries  from  the  processor  storing  the  leaf  nodes. 

In  this  section  the  implementation  of  the  insert ,  delete  and  search  operations  for  each  of  the  algorithms 
is  described.  For  a  tree  of  H  levels,  processor  Ph  (where  1  <  h  <  H)  is  an  index  processor  and  processor 
PL  is  the  leaf  processor,  as  shown  in  Figure  1  Processor  Ph  (where  1  <  h  <  H)  communicates  only  with 
Ph- i  and  P/.+  i .  Processor  Ph  communicates  with  Ph-\  and  the  Server.  Processor  P\  communicates 
with  P>  and  the  Server. 

6.1  The  Top-Down  Algorithm 

When  processor  Ph  receives  a  search(k,x)  message  (search  for  key  k  using  node  x )  it  selects  y,  the 
appropriate  child  of  z,  and  sends  the  message  search(k,y)  to  processo'  Ph-i  When  processor  P\  receives 
a  search(k,x)  message  it  searcues  the  keys  stored  in  node  x  for  the  key  k  and  returns  the  result  to  the 
Server  processor. 

During  the  insert  operation  processor  Ph  receives  an  insciiJransform(k,x)  (transform  node  z,  in¬ 
serting  key  k)  message.  A  transformation  is  applied  if  z  is  a  2a-branch  node.  In  this  case  a  new  node 
r’  with  splitting  key  k’  is  created,  and  the  message  inserLreply(k’,x’)  is  sent  to  processor  P/,+  i. 


Oth- 
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erwise,  the  message  insert-reply(k,nil)  is  sent  to  processor  Ph+u  where  nil  is  a  dummy  node  value  to 
indicate  that  no  transformation  occurred.  Processor  Ph  then  selects  the  appropriate  child  y  and  the 
message  %nsert-transform(k,y)  is  sent  to  processor  Ph-i-  Processor  Ph  then  waits  until  it  receives  the 
insert-reply  (k’’,y’)  message  from  Ph-i  whereupon  if  y’  is  not  nil  it  adds  the  new  child  and  splitting  key 
to  the  node  x  (or  to  x’  if  appropriate).  When  processor  Pi  receives  an  inseri.transform(k,x)  message  it 
determines  whether  a  transformation  should  be  applied  and  proceeds  in  the  manner  of  processor  Pj,.  If 
the  key  is  not  already  present  it  is  inserted  into  the  appropriate  leaf  node.  When  the  Server  processor 
receives  the  insert-reply (k’,y’)  messages  from  processor  Ph  and  y  is  not  nil,  a  new  thread  is  started 
on  processor  P//+i,  which  becomes  the  new  root  processor.  Figure  2  shows  an  example  of  an  tnser: 
operation  applied  to  a  (2,8)  tree. 

The  delete  operation  proceeds  in  the  style  of  insert  with  the  appropriate  transformation  and  replies 
for  updating  index  information  being  applied  at  every  stage.  However,  processor  Ph-i  controls  shrinking 
(as  opposed  to  Ph  for  growing).  Therefore,  in  the  case  where  a  delete-transform  message  is  sent  to  a 
2-branch  root  node  at  Ph  the  reply  to  the  Server  is  delayed  until  Ph~i  completes  communication  with 

Ph- 2- 

Examples  of  delete  operations  with  search  key  50  applied  to  the  nodes  of  a  (2,8)  tree  are  shown 
in  Figures  3  and  4.  Note  that  the  node  address  included  in  the  delete-reply  message  indicates  the 
transformation  that  has  been  applied.  If  the  address  of  the  transformed  node  ( X  in  this  case)  is  included 
in  the  reply  this  indicates  that  a  merging  transformation  was  applied.  A  redistribution  occurred  if  the 
address  of  the  neighboring  node  is  included. 

Thus,  the  top-down  algorithm  is  a  lock-coupling  algorithm;  processor  Ph  is  locked  while  a  transfor¬ 
mation  is  applied  at  processor  Ph- 1.  The  reply  message  causes  the  lock  on  Ph  to  be  released  and  is 
mandatory  following  the  sending  of  a  transform  message  to  Ph- 1  even  if  no  actual  transformation  oc¬ 
curs.  For  n  update  operations  on  a  tree  with  approximately  n  entries,  the  top-down  algorithm  requires 
approximately  n  logfl  n  downward  messages  and  the  same  number  of  upward  messages. 

6.2  The  Bottom-Up  Algorithm 

In  the  implementation  of  the  bottom-up  algorithm,  two  forms  of  the  fmd(k,x)  message  are  used  to 
distinguish  between  the  implementation  of  a  search  operation  ( findsearch )  and  the  implementations  of 
the  insert  and  delete  operations  (find-update).  When  processor  Ph  receives  a  findsearch(k,x,f)  message, 
meaning  it  should  search  for  key  k  from  node  x  and  parent  /,  it  executes  x':=find-node(k,x)  with  the 
following  code  : 


6.2  The  Bottom-Up  Algorithm 
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Stage  1:  The  processor  storing  W  sends  an  “inserLtransform  at 
node  X  with  key  50”  message  to  the  processor  storing  X. 


Stage  2:  X  is  split  to  form  node  X’  and  the  processor  storing  X 
sends  an  insert.reply(69,X’)  message  to  the  processor  storing  W. 


Stage  3:  The  processor  storing  W  adds  the  new  splitting  key,  69, 
and  the  address  of  X  ’  to  W. 

Figure  2:  An  insert  operation  applied  to  a  (2,8)  tree  using  the  top-down  algorithm.  The  shaded  nodes 
represent  those  nodes  that  are  unchanged  during  a  particular  stage  in  the  transformation  process. 
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Stage  1:  The  processor  storing  W  sends  an  “delete -transform  at 
node  X  using  neighbor  Y  with  key  50”  message  to  the  processor 
storing  X. 


Stage  2:  X  and  Y are  merged  and  Y  is  deleted.  The  processor  stor¬ 
ing  X  sends  an  delete.reply(45,X)  message  to  the  processor  storing 


W. 


Stage  3:  The  processor  storing  W  removes  the  entry  for  V  from 
W. 


Figure  3:  A  delete  operation  applied  to  a  (2,8)  tree  using  the  top-down  algorithm  and  causing  merging. 


6.2  The  Bottom-Up  Algorithm 
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Stage  1:  The  processor  storing  W  sends  an  “delete  Jransform  at 
node  X  using  neighbor  Y  with  key  50”  message  to  the  processor 
storing  X. 


Stage  2:  The  children  of  X  and  Y  are  redistributed.  The  proces¬ 
sor  storing  X  sends  an  delete_reply( 77,  Y)  message  to  the  processor 
storing  W. 


Stage  3:  The  processor  storing  W  changes  the  key  value  associated 
with  Y. 

Figure  4:  A  delete  operation  applied  to  a  (2,8)  tree  using  the  top-down  algorithm  and  causing  redistri¬ 
bution. 
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lind.node  =  procedure (k: key ;  x:node)  returns  (node) 
if  (x. deleted)  then 

if  (x.left  =  nil)  then  return  lind_node(k,x. right) ; 
else  return  find_node(k, x.left) ; 
else  if  (x.i  >  k)  then  return  find_node(k, x.left) ; 
else  if  (x. right  =  nil)  then  return(x); 

else  if  (x. right,  i  <=  k)  then  return  f  5~  ■’._node(k,x. right ) ; 
else  return(x); 
end  find_node 

where  x. right  and  x.left  are  the  right  and  left  brothers  of  x  (pointed  to  by  the  links  at  x),  and  x.  deleted 
has  the  value  true  when  x  has  been  deleted.  find.node(x,k)  returns  the  node  at  the  same  level  as  x  from 
which  the  key  k  is  covered.  The  node  splitting  associated  with  the  insertion  transformation  may  cause 
traversal  along  the  right  link  and  the  node  redistribution  associated  with  the  deletion  transformation 
may  cause  traversal  along  the  left  link. 

The  additional  parameter  /  of  find-search  is  the  address  of  the  parent  of  x.  Each  search  tree  node 
(other  than  the  root)  maintains  the  address  of  its  parent  so  as  to  direct  the  propagation  of  transformations 
to  higher  levels  in  the  search  tree.  When  a  find-search  message  is  received,  the  parent  pointer  of  x  is 
assigned  the  value  of  /.  This  propagates  the  changes  made  to  parents  to  their  children. 

Processor  Pfc  then  selects  y,  the  appropriate  child  of  x\  and  sends  the  message  fin<Lsearch(k,y,x’)  to 
P/i-i-  When  processor  P\  receives  a  findsearch  (k,x,f)  message  it  executes  x’=find-node(k,x),  searches 
the  keys  stored  in  node  x’  for  the  key  k,  and  returns  the  result  to  the  Server  processor. 

During  an  insert  or  delete  operation  processor  P/,  receives  a  find-iipdate(k,x,f)  message.  The  routine 
find-node  is  again  called  and  the  appropriate  child  y  of  x’  is  selected.  Processor  P/,  then  sends  the 
message  find-update(k,y,x’)  to  processor  Ph-i-  When  processor  Pi  receives  a  find-update(k,x,f)  message 
the  parent  pointer  is  updated  as  before  and  x’=find.node(k,x)  is  executed.  A  transformation  is  then 
applied  to  x’if  required  and  the  key  k  is  either  added  to  (for  insert)  or  deleted  from  (for  delete )  the 
appropriate  node.  A  message  indicating  completion  is  then  sent  to  the  Server  processor. 

If  an  insertion  transformation  occurred  at  the  leaf  level,  processor  Pi  sends  an  insert  Jransform(k’,x’,f) 
message  (insert  the  new  node  x' with  splitting  key  it' and  parent  /)  to  processor  P2.  Processor  P2  calls 
find.node(f,k’)  to  determine  the  node  to  which  x’  and  k’  should  be  added.  If  a  transformation  occurs 
as  a  result  of  this  addition  then  another  insertJransform  message  is  sent  to  processor  P3.  This  pro¬ 
cess  continues  until  a  level  is  reached  where  no  transformation  occurs.  Should  the  Server  receive  an 
insertJransform  message  from  the  root  processor  then  a  new  thread  is  started  on  processor  Ptf+i  and 
Ph+i  becomes  the  new  root  processor.  An  example  of  an  insertJransform  message  applied  to  the  nodes 
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of  a  (2, 8)  tree  is  shown  in  Figure  5.  When  the  search  tree  is  in  Stage  2  of  Figure  5,  a  find  or  find.update 
operation  sent  to  X  from  W  with  a  key  value  greater  than  or  equal  to  69  results  in  Z  being  selected  by 
find-node. 

The  processing  of  the  transformations  for  the  delete  operation  proceeds  in  the  style  of  insert  with 
the  appropriate  transformation  for  updating  index  information  being  applied  at  every  stage.  Examples 
of  delete-transform  message  applied  to  the  nodes  of  a  (2, 8)  tree  are  shown  in  Figures  6  and  7.  Note  that 
the  number  of  key  values  included  in  the  delete-transform  message  indicates  the  transformation  that  was 
applied  at  the  sender.  If  only  a  single  key  is  included  then  the  entry  for  the  child  with  this  key  should 
be  removed  at  the  higher  level.  If  two  key  values  are  included  then  the  entry  for  the  child  with  the  first 
value  should  be  changed  to  the  second  key  value.  The  use  of  find-node  to  determine  the  correct  node  to 
update  allows  the  maintenance  of  the  parent  pointers  to  be  carried  out  in  this  lazy  style. 

For  n  update  operations  the  bottom-up  algorithm  requires  approximately  n  logB  n  downward  mes¬ 
sages.  However,  upward  messages  only  occur  following  a  transformation.  In  practice  this  results  in 
significantly  fewer  upward  messages  than  in  the  top-down  case  as  shown  by  the  experimental  results 
reported  in  Table  1  in  the  next  section. 


7  Relative  Performance 

This  section  compares  the  relative  performance  of  the  two  algorithms  using  the  results  of  simulations. 
Three  sets  of  simulations  measured  the  throughputs  and  response  times  of  each  algorithm  for  different 
values  of  the  branching  factor  constant  B  and  various  query  mixes.  A  fourth  set  of  simulations  then 
measured  the  relative  performance  of  the  algorithms  under  changes  to  the  one-way  message  delay.  Finally, 
we  compared  the  performance  of  the  algorithms  using  synchronous  and  asynchronous  message  passing 
between  the  processors. 

We  conducted  a  number  of  sets  of  simulations  for  each  of  the  algorithms  using  a  range  of  values  for 
the  branching  constant  B  in  each  case;  three  sets  of  results  are  presented  in  this  paper.  B  was  varied 
between  2  (this  was  not  strictly  a  (2B~7,2B)  search  tree  and  actually  represented  the  2-3-4  tree  used 
by  Carey  and  Thompson  [CT84])  and  8  (a  64-256  tree).  The  size  of  the  processor  array  was  varied  for 
each  case  so  that  only  the  number  of  processors  required  was  used.  The  throughput  was  measured  in 
terms  of  the  average  number  of  tree  operations  that  complete  for  every  one  hundred  thousand  machine 
cycles  and  the  response  time  was  measured  in  terms  of  the  average  number  of  machine  cycles  between 
the  Server  sending  a  query  and  receiving  the  corresponding  reply. 

'The  first  set  of  simulations,  Test  1,  measured  the  performance  of  the  algorithms  when  50,000  tnsert 
operations  using  random  key  values  were  applied  to  an  initially  empty  tree.  For  the  second  and  third 
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The  value  of  W.i 


Stage  1:  The  processor  storing  X  receives  an  insert-transform  mes¬ 
sage  informing  it  that  a  new  node  Z  has  been  generated  at  the  level 
below. 


Stage  2:  findjnode(75,X)  is  invoked,  which  returns  node  X  in  this 
case.  X  is  split  to  form  node  X’  and  the  horizontal  links  in  X 
and  Y  are  updated  accordingly.  The  processor  storing  X  sends  an 
insert Jransform(69,X\  W)  message  to  the  processor  storing  W. 


Stage  3:  The  processor  storing  W  invokes  find.node(69,  W),  which 
returns  node  W  in  this  case.  The  new  splitting  key,  69 ,  and  the 
address  of  X’  are  added  to  W.  Since  no  further  transformation  is 
required  no  additional  upward  messages  are  generated. 


Figure  5:  An  insertjransform  operation  applied  to  a  (2,8)  tree  using  the  bottom-up  algorithm.  The 
shaded  nodes  represent  those  nodes  that  are  unchanged  during  a  particular  stage  in  the  transformation 
process. 
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Stage  1:  The  processor  storing  X  receives  a  delete Jransform(51  ,X) 
message  informing  it  that  the  entry  for  key  value  51  has  been 
deleted. 


Stage  2:  find.node(75,X)  is  invoked,  which  returns  node  X  in  this 
case.  X  and  Z  are  merged  and  Z  is  marked  as  deleted.  The  hor¬ 
izontal  links  in  X  and  Y  are  updated  accordingly.  The  processor 
storing  X  sends  a  delete Jransform(69,  1TJ  message  to  the  processor 
storing  W. 


Stage  3:  The  processor  storing  IT  invokes  find.node(69,W ),  which 
returns  node  W  in  this  case.  The  entry  for  splitting  key  69  is 
removed  from  W.  Since  no  further  transformation  is  required  no 
additional  upward  messages  are  generated. 

Figure  6:  A  delete  .transform  operation  applied  to  a  (2, 8)  tree  using  the  bottom-up  algorithm  and  causing 
merging. 
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Stage  1:  The  processor  storing  X  receives  a  delete Jransform(51,X) 
message  informing  it  that  the  entry  for  key  value  51  has  been 
deleted. 


Stage  2:  jind-node(51  ,X)  is  invoked,  and  returns  node  X  in  this 
case.  The  children  of  X  and  Z  are  redistributed.  The  processor 
storing  X  sends  a  deleleJransform(69,75,  W)  message  to  the  pro¬ 
cessor  storing  W. 


Stage  3:  The  processor  storing  W  invokes  find.node(69,W),  which 
returns  node  W  in  this  case.  Its  splitting  key  values  are  updated. 
Since  no  further  transformation  is  required  no  additional  upward 
messages  are  generated. 


Figure  7:  A  dele teJransform  operation  applied  to  a  (2,8)  tree  using  the  bottom-up  algorithm  and  causing 
redistribution. 
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sets  of  simulations,  Test  2  and  Test  3 ,  1,000  insert  operations  using  random  keys  were  first  applied  to 
an  empty  search  tree  followed  by  10,000  randomly  selected  operations.  For  Test  2  the  percentages  of 
insert,  delete  and  search  operations  in  this  random  selection  were  50%,  30%  and  20%,  respectively,  and 
for  Test  3  they  were  33%,  33%  and  34%,  respectively.  Each  time  a  delete  operation  was  applied  the  key 
value  closest  to  the  selected  key  value  was  deleted. 


The  throughputs  and  response  times  are  shown  in  Figures  8  and  9.  In  all  cases  the  throughput 
improves  for  increasing  values  of  B  up  to  4  (or  5  for  Test  2  applied  to  the  bottom-up  algorithm).  This 
peaked  response  is  caused  by  a  trade-off  between  the  number  of  transformations  and  the  processing  time 
for  searching  the  key  values  stored  at  a  node.  For  low  values  of  B,  insert  and  delete  operations  cause 
more  transformations  to  the  tree  structure.  Transformations  increase  the  average  processing  time  for  a 
query  and  leads  to  a  reduction  in  throughput.  For  higher  values  of  B  the  time  required  to  search  and 
updone  the  keys  stored  at  a  node  increases.  This  also  increases  the  average  processing  time  for  a  query 
and  leads  to  a  reduction  in  throughput.  Therefore  an  optimal  value  for  B  exists  where  neither  effect 
leads  to  a  significant  degradation  in  throughput. 


The  response  times  reach  a  minimum  value  as  B  increases  and  then  grows  as  B  continues  to  increase. 
The  value  of  B  governs  the  number  of  processors  in  the  ring;  as  B  increases  the  number  of  processors 
decreases  since  a  greater  number  of  key  values  are  stored  in  each  node.  For  increasing  values  of  B  below 
this  minimum,  the  improvement  in  response  time  is  caused  by  the  reduction  in  the  number  of  inter¬ 
processor  hops  required  for  a  single  query.  For  increasing  values  of  B  greater  than  the  minimum,  the 
degradation  in  response  time  occurs  due  to  the  increase  in  the  processing  time  at  a  single  node.  Therefore 
there  is  a  trade-off  between  the  computation  at  a  processor  and  communication  between  processors  to 
achieve  the  optimal  response  time. 


When  the  two  algorithms  are  compared,  the  bottom-up  algorithm  performs  better  in  both  throughput 
and  response  time  for  all  cases.  The  improvement  in  throughput  arises  because  no  lock  coupling  is 
required  and  upward  messages  only  occur  when  a  transformation  is  required.  The  majority  of  upward 
messages  in  the  top-down  case  merely  verify  that  no  transformation  took  place.  The  counts  of  upward 
messages  required  during  Test  2  are  given  in  Table  1.  The  bottom-up  algorithm  requires  significantly 
fewer  upward  messages  in  ail  cases,  leading  to  very  little  contention  between  messages.  The  numbers 
of  downward  messages  required  by  the  two  algorithms  are  approximately  the  same  and  are  equal  to  the 
number  of  upward  messages  in  the  top-down  case. 


Response  Time 
(xlOOO  machine  cycles) 


Figure  9:  The  response  times  for  the  search  trees. 
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Branching  constant 

B 

Top-Down 

algorithm 

Bottom-Up 

algorithm 

2 

65213 

2556 

3 

48805 

884 

4 

37329 

345 

5 

27568 

150 

6 

26143 

71 

7 

19654 

36 

8 

19422 

15 

Table  1:  The  numbers  of  upward  messages  required  during  Test  2 


As  noted  earlier,  the  changes  to  non-leaf  levels  caused  by  transformations  in  the  bottom-up  algorithm 
are  made  by  a  background  task.  The  find  operation  guarantees  that  the  correct  leaf  node  is  reached 
independent  of  whether  all  the  required  changes  have  been  made.  This  may  require  traversal  of  the 
horizontal  links  connecting  adjacent  nodes  leading  to  an  increase  in  processing  time.  The  numbers  of 
times  a  horizontal  link  was  traversed  during  Test  2  are  given  Li  Table  2.  The  number  of  traversals 
increases  as  the  value  of  B  decreases  but  in  all  cases  the  numbers  are  small  when  compared  with  the 
total  number  of  operations,  which  is  11000.  The  fact  that  so  few  horizontal  traversals  are  made  partly 
accounts  for  the  low  response  times  of  the  bottom-up  algorithm. 


Branching  constant 

Horizontal  Traversals 

2 

29 

3 

19 

4 

9 

5 

7 

6 

5 

7 

2 

8 

2 

Table  2:  The  numbers  of  horizontal  traversals  required  during  Test  2 

The  response  time  for  an  individual  query  is  also  better  in  the  bottom-up  algorithm  because  trans¬ 
formations  at  non-leaf  levels  are  conducted  after  the  query  has  terminated.  Transformations  propagate 
up  the  tree  only  as  far  as  necessary,  which  reduces  the  contention  experienced  by  later  queries. 

The  effect  of  variations  in  the  message  delay  was  measured  for  each  algorithm  when  10,000  i nsert 
operations  using  random  key  values  were  applied  to  an  (8,32)  tree  (B=  5)  usmg  synchronous  message 
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passing.  As  the  one-way  message  delay  increased  the  throughputs  of  both  the  top-down  and  bottom-up 
algorithms  decreased  at  approximately  the  same  rate  (although  the  overall  percentage  change  is  smaller 
for  the  bottom-up  algorithm).  The  response  time  for  both  algorithms  increased  as  the  one-way  message 
delay  increased.  However,  the  increase  for  the  bottom-up  algorithm  was  less  signific  int  than  that  for  the 
top-down  algorithm.  This  difference  js  caused  by  the  lock  coupling  and  the  greater  number  of  upward 
messages  in  the  top-down  case. 

Experiments  were  conducted  using  both  synchronous  and  asynchronous  message  passing  between 
processors.  The  synchronous  message  style  is  similar  to  that  used  in  transputer  systems  [Inm86].  The 
synchronous  messages  have  the  following  protocol.  The  sender  sends  a  message  to  the  target  and  waits 
for  an  acknowledgment.  The  message  causes  an  interrupt  at  the  target.  The  target  processor  must 
independently  issue  a  receive  command.  When  a  receive  command  is  issued,  if  a  message  is  already 
present,  the  receive  routine  sends  an  acknowledgment  to  the  sender  and  returns  the  address  of  the 
message.  If  no  message  has  been  received,  the  receive  routine  waits  for  the  interrupt  and  then  handles 
the  message. 

For  asynchronous  messages  the  sender  sends  a  message  to  the  target  and  does  not  wait  for  an 
acknowledgment.  The  message  causes  an  interrupt  at  the  target  and  is  stored  in  a  buffer  of  waiting 
messages.  The  messages  in  this  buffer  are  serviced  in  a  FIFO  order.  When  the  target  processor  issues  a 
receive  command  a  waiting  message  is  removed  from  the  buffer.  If  no  messages  have  bee"  received  the 
receive  routine  waits  for  the  interrupt.  Using  asynchronous  message  as  opposed  to  synchronous  messages 
for  the  bottom-up  algorithm  leads  to  improvements  in  both  throughput  and  response  time  as  the  need  for 
synchronization  between  adjacent  processors  is  removed  and  the  number  of  messages  between  processors 
decreases.  However,  the  top-down  algorithm  is  implicitly  synchronous  since  the  parent  processor  always 
waits  for  the  reply  from  its  son.  Thus  moving  from  synchronous  to  asynchronous  message  passing  does 
not  affect  performance  in  this  case. 


8  Design  Alternatives 

In  the  H-tree  algorithms  described  in  [WW90,  Wan91,  LY81J  the  rightmost  node  examined  at  each  level 
is  pushed  onto  a  stack  during  the  downward  traversal  of  the  search  tree.  This  stack  is  included  in 
the  find-update  message.  If  a  transformation  occurs,  the  contents  of  the  stack  are  used  to  provide  an 
mdication  of  the  node  to  be  updated  at  a  given  level  in  the  tree.  This  is  only  am  indication  as  subsequent 
transformations  may  have  occurred  between  the  original  downward  traversal  and  the  propagation  of 
transformations  The  message  length  for  update  operations  and  transformations  is  O(logn),  since  the 
stack  size,  and  hence  the  message  size,  is  proportional  to  the  height  of  the  tree. 
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The  alternative  technique  used  in  the  algorithm  described  here  maintains  a  pointer  to  the  parent 
node  at  each  node  (other  than  the  root)  in  the  search  tree.  These  pointers  are  used  to  give  a  similar 
indication  of  the  node  to  be  updated  following  a  transformation.  A  question  arises  as  to  how  the  pointers 
should  be  maintained  as  transformations  at  level  A  cause  inconsistencies  in  some  of  the  parent  pointers  in 
level  A-/.  We  investigated  several  solutions  and  compared  their  performance  to  that  of  the  stack-based 
method. 

The  approach  described  in  Section  5.2  includes  the  address  of  the  parent  node  in  every  find  message 
sent  to  a  child  regardless  of  whether  a  change  is  required.  We  have  termed  this  a  conservative  pointer- 
based  approach.  The  address  is  then  assigned  to  the  parent  pointer  of  the  child.  This  leads  to  a 
shorter  average  message  length  than  the  stack-based  method  (0(1)  compared  to  O(logn)),  and  proved 
to  be  simpler  to  implement.  In  addition,  since  the  original  downward  traversal  will  have  updated  the 
parent  pointers  of  all  the  nodes  accessed,  the  indication  given  by  the  pointer-based  approach  will  always 
be  at  least  as  good  as  that  given  by  the  stack-based  method.  In  cases  where  changes  to  the  parent 
pointers  occur  between  the  original  downward  traversal  and  the  propagation  of  the  transformations,  the 
indication  given  by  the  pointer-based  approach  is  better  than  that  given  by  the  stack-based  method. 
Although  the  differences  in  performance  are  marginal,  for  long  processor  arrays,  where  the  messages  in 
the  stack-based  method  are  significantly  longer,  the  pointer-based  approach  has  superior  performance. 
In  systems  where  messages  are  divided  into  small  packets  before  transmission  across  the  network,  the 
address-based  approach  may  lead  to  fewer  packets,  therefore  reducing  network  latency  and  contention 
and  improving  performance. 

Two  alternative  techniques  for  maintaining  the  parent  pointers  were  investigated.  Maintaining  a  list 
of  the  inconsistent  children  at  level  A  and  notifying  these  nodes  of  the  change  to  their  parent  pointer  when 
they  are  next  accessed  proves  to  be  a  costly  solution.  Checks  have  to  be  made  before  sending  a  message 
and  after  receiving  a  message  to  determine  whether  changes  are  required.  Alternatively,  a  new  message 
type  may  be  introduced  that  is  sent  by  processor  P\  to  processor  Ph-i  following  a  transformation  at 
Ph.  The  message  simply  notifies  P^- i  of  the  required  changes  to  the  parent  pointers.  Although  this 
approach  has  shorter  messages  than  the  conservative  pointer-based  approach,  it  leads  to  an  increase  in 
the  total  number  of  messages.  Both  these  alternatives  exhibit  poorer  performance  than  the  stack-based 
and  the  conservative  pointer-based  approaches. 

Several  different  algorithms  can  be  used  during  the  execution  of  the  find  operation  to  ensure  the 
required  leaf  node  is  reached  in  the  bottom-up  algorithm.  The  small  number  of  horizontal  traversals 
given  in  Table  2  for  the  bottom-up  algorithm  suggests  that  choosing  the  wrong  node  during  a  downward 
traversal  is  a  very  infrequent  event.  A  possible  optimization  of  the  algorithm  is  to  assume  that  the 
correct  node  is  always  chosen  and  to  only  execute  the  finiLnode  routine  given  in  Section  6.2  at  the  leaf 
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level.  If  the  wrong  node  is  chosen  at  a  branch  level  then  the  downward  traversed  will  descend  to  either 
the  leftmost  or  rightmost  node  in  the  subtree  of  which  the  chosen  node  is  the  root  -  leftmost  if  the  chosen 
node  has  a  key  value  greater  than  the  search  key  and  rightmost  if  the  chosen  node  has  a  key  value  less 
than  the  search  key.  This  gives  about  a  10%  improvement  in  response  time  over  the  findjnode  algorithm 
given  in  Section  6.2.  The  throughputs  are  approximately  the  same;  the  execution  of  find-node  at  the 
leaf  level  prevents  any  improvement.  However,  an  optimization  can  be  made  so  that  find-node  is  called 
at  every  level  only  when  the  wrong  node  may  have  been  selected.  If  the  leftmost  or  rightmost  child  is 
selected  as  the  next  node  then  find-node  is  called  and  any  required  horizontal  traversals  are  made.  This 
leads  to  similar  improvements  in  response  time  over  the  algorithm  given  in  Section  6.2. 


9  Conclusions 

We  have  shown  that  a  (2B~2,2B)  search  tree  of  n  entries  can  be  implemented  on  a  linear  array  of  up  to 
(log2  n/(B  —  2)]  +  1  processors,  where  each  processor  stores  a  level  of  the  tree  structure.  Such  a  linear 
array  may  be  physically  mapped  onto  processors  in  two  or  three  dimensions  on  the  majority  of  available 
architectures.  Updates  can  be  performed  on  the  tree  using  both  top-down  and  bottom-up  algorithms. 

The  top-down  node-splitting  algorithm  uses  lock  coupling  to  apply  transformations  during  a  single 
traversal  of  the  tree  structure.  However,  processors  are  required  to  wait  for  replies  most  of  which  merely 
verify  that  no  transformation  occurred. 

The  introduction  of  side  links  between  adjacent  nodes  at  the  same  level  eliminates  the  need  for 
lock  coupling  and  permits  a  bottom-up  algorithm  to  be  used.  This  algorithm  allows  the  transformations 
resulting  from  changes  to  the  tree  structure  to  be  performed  asynchronously  from  the  leaf  nodes  upwards, 
while  guaranteeing  the  correctness  of  other  operations  concurrently  executing  on  the  data  structure.  The 
use  of  parent  pointers  to  give  an  indication  of  the  node  to  be  updated  at  a  higher  level  leads  to  improved 
performance  when  compared  to  the  stack-based  technique  used  in  other  B-tree  algorithms. 

In  a  series  of  simulations  conducted  for  both  algorithms,  the  bottom-up  approach  gives  significantly 
better  query  throughput  and  response  time.  The  number  of  upward  messages  (and  hence  the  contention) 
between  adjacent  processors  in  the  linear  array  is  significantly  less  for  the  bottom-up  algorithm.  Fur¬ 
thermore,  the  bottom-up  algorithm  shows  increasingly  superior  performance  relative  to  the  top-down 
algorithm  Z3  the  one-way  message  delay  between  adjacent  processors  increases.  Improved  performance 
is  also  achieved  for  the  bottom-up  algorithm  when  asynchronous  as  opposed  to  synchronous  message 
passing  is  used. 

The  bottom-up  algorithm  for  the  ( 2B~ 2, 2B)  search  tree  has  been  shown  to  provide  a  highly  efficient 
and  flexible  implementation  of  dictionary  abstract  data  types  on  message-passing  MIMD  architectures. 
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For  a  given  cost  ratio  of  computation  to  communication  the  value  of  B  may  be  varied  to  maximize 
performance.  The  algorithm  also  introduces  a  stabilizing  hysteresis  behavior  that  is  not  present  in  many 
other  balanced-tree  algorithms.  The  bottom-up  algorithm  described  here  has  the  best  performance  of 
any  of  the  implementations  of  search  trees  for  linear  arrays  described  to  date. 
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